Distributed-Memory Computing With 
the Langley Aerothermodynamic 
Upwind Relaxation Algorithm (LAURA) 

Christopher J. Riley* and F. McNeil Cheatwood^ 

NASA Langley Research Center, Hampton, VA 23681 

Paper Presented at the 

4th NASA National Symposium on Large-Scale Analysis and Design on 
High-Performance Computers and Workstations 
Oct. 15-17, 1997/Williamsburg, VA 

The Langley Aerothermodynamic Upwind Relaxation Algorithm (LAURA), a Navier- 
Stokes solver, has been modified for use in a parallel, distributed-memory environment 
using the Message- Passing Interface (MPI) standard. A standard domain decomposition 
strategy is used in which the computational domain is divided into subdomains with 
each subdomain assigned to a processor. Performance is examined on dedicated parallel 
machines and a network of desktop workstations. The effect of domain decomposition 
and frequency of boundary updates on performance and convergence is also examined for 
several realistic configurations and conditions typical of large-scale computational fluid 
dynamic analysis. 


Introduction 

The design of an aerospace vehicle for space trans- 
portation and exploration requires knowledge of the 
aerodynamic forces and heating along its trajectory. 
Experiments (both ground-test and flight) and compu- 
tational fluid dynamic (CFD) solutions are currently 
used to provide this information. At high- altitude, 
high-velocity conditions that are characteristic of at- 
mospheric reentry, CFD contributes significantly to 
the design because of the ability to duplicate flight 
conditions and to model high temperature effects. Un- 
fortunately, CFD solutions of the hypersonic, viscous, 
reacting-gas flow over a complete vehicle are both CPU 
time and memory intensive even on the most power- 
ful supercomputers; hence, the design role of CFD is 
generally limited to a few solutions along a vehicle’s 
trajectory. 

One CFD code that has been used extensively for 
the computation of hypersonic, viscous, reacting-gas 
flows over reentry vehicles is the Langley Aerothermo- 
dynamic Upwind Relaxation Algorithm (LAURA). 1,2 
LAURA has been used in the past to provide aerother- 
modynamic characteristics for a number of aerospace 
vehicles (e.g. AFE, 3 HL-20, 4 Shuttle Orbiter, 5 Mars 
Pathfinder, 6 SSTO Access to Space 7 ) and is currently 
being used in the design and evaluation of blunt aero- 
braking configurations used in planetary exploration 
missions 8,9 and Reusable Launch Vehicle (RLV) con- 
cepts (e.g. the X-33 10, 11 and X-34 12 programs). Al- 
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though the LAURA computer code is continually be- 
ing updated with new capabilities, it is a mature piece 
of software with numerous options and utilities that 
allow the user to tailor the code to a particular appli- 
cation. 13 

LAURA was originally developed and tuned for mul- 
tiprocessor, vector computers with shared memory 
such as the CRAY C-90. Parallelism using LAURA is 
achieved through the use of macrotasking where large 
sections of code are executed in parallel on multiple 
processors. Because LAURA employs a point-implicit 
relaxation strategy that is free to use the latest avail- 
able data from neighboring cells, the solution may 
evolve without the need to synchronize tasks. This 
results in a very efficient use of the multitasking capa- 
bilities of the supercomputer. 14 But future supercom- 
puting may be performed on clusters of less powerful 
machines that offer a better price per performance 
than current large-scale vector systems. Parallel com- 
puters such as the IBM SP2 consist of large numbers of 
workstation-class processors with memory distributed 
among the processors instead of being shared. In 
addition, improvements in workstation processor and 
network speed and the availability of message-passing 
libraries allow networks of desktop workstations (that 
may sit idle during non-work hours) to be used for 
practical parallel computations. 15 As a result, many 
CFD codes are making the transition from serial to 
parallel computing. 16-20 The current shared-memory, 
macrotasking version of LAURA requires modifica- 
tion before exploiting these distributed-memory par- 
allel computers and workstation clusters. 

Several issues need to be addressed in creating a 
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distributed-memory version of LAURA: 1) There is 
the choice of programming paradigm to use. A domain 
decomposition strategy 17 (which involves dividing the 
computational domain into subdomains and assigning 
each to a processor) is a popular approach to massively 
parallel processing and is chosen due to its similarity 
to the current macrotasking version. 2) To mini- 
mize memory requirements, the current data struc- 
ture of the macrotasking, shared-memory version is 
changed since each processor requires storage only for 
its own subdomain. 3) The choice of message-passing 
library (which processors use to explicitly exchange in- 
formation) may impact portability and performance. 
4) The frequency of boundary data exchanges be- 
tween computational subdomains can influence (and 
may impede) convergence of a solution although the 
point-implicit nature of LAURA already allows asyn- 
chronous relaxation. 14 5) There are also portabil- 
ity and performance concerns involved in designing a 
version of LAURA to run on different (cache-based 
and vector) architectures. 6) Finally, a distributed- 
memory, message-passing version of LAURA should 
retain all of the functionality, capabilities, utilities, 
and ease of use of the current shared-memory version. 

This paper describes the modifications to LAURA 
that permit its use in a parallel, distributed-memory 
environment using the Message-Passing Interface 
(MPI) standard. 21 An earlier, elementary version of 
LAURA for perfect gas flows using the Parallel Vir- 
tual Machine (PVM) library 22 provides a guide for the 
current modifications. 23 Performance of the modified 
version of LAURA is examined on dedicated paral- 
lel machines (e.g. IBM SP2, SGI Origin 2000, SGI 
multiprocessor) as well as on a network of worksta- 
tions (e.g. SGI R10000). Also, the effect of domain 
decomposition and frequency of boundary updates on 
performance and convergence is examined for several 
realistic configurations and conditions typical of large- 
scale CFD analysis. 

LAURA 

LAURA is a finite- volume, shock-capturing algo- 
rithm for the steady-state solution of inviscid or vis- 
cous, hypersonic flows on rectangularly ordered, struc- 
tured grids. The upwind-biased inviscid flux is con- 
structed using Roe’s flux-difference splitting 24 and 
Harten’s entropy fix 25 with second-order corrections 
based on Yee’s symmetric total-variation-diminishing 
scheme (TVD). 26 Gas chemistry options include per- 
fect gas, equilibrium air, and air in chemical and ther- 
mal nonequilibrium. More details of the algorithm can 
be found in Refs. 1, 2 and 13. 

The point-implicit relaxation strategy is obtained 
by treating the variables at the local cell center L at 
the advanced iteration level and using the latest avail- 
able data from neighboring cells. Thus, the governing 
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Fig. 1 Domain decomposition of macrotasking 
version. 

relaxation equation is 

M L< 5q L = r L (1) 

where is the n x n point-implicit Jacobian, 
is the vector of conserved variables, rL is the resid- 
ual vector, and n is the number of unknown variables. 
For a perfect gas and equilibrium air, n is equal to 5. 
For nonequilibrium chemistry, n is equal to 4 plus the 
number of constituent species. The residual vector r / 
and the Jacobian M l are evaluated using the latest 
available data. The change in conserved variables, q^, 
may be calculated using Gaussian elimination. An LU 
factorization of the Jacobian can be saved (frozen) over 
large blocks of iterations (« 10 to 50) to reduce com- 
putational costs as the solution converges. However, 
the Jacobian will need to be updated every iteration 
early in the computation when the solution is changing 
rapidly. 

Macrotasking 

LAURA utilizes macrotasking by assigning pieces of 
the computational domain to individual tasks. First, 
the computational domain is divided into blocks, 
where a block is defined as a rectangularly ordered 
array of cells containing all or part of the solution 
domain. Then each block may be subdivided in the 
computational sweep direction into one or more par- 
titions. Partitions are then separately assigned to a 
task (processor). Figure 1 shows a two-dimensional 
(2D) domain divided into 2 blocks with each block di- 
vided into 2 partitions. Thus a task may work on one 
or more partitions which may be contained in a single 
block or may overlap several blocks. Each task then 
gathers and distributes its data to a master copy of 
the solution which resides in shared memory. With 
the point-implicit relaxation, there is no need to syn- 
chronize tasks which results in a very efficient parallel 
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Fig. 2 Domain decomposition of message-passing 
version. 

implementation. 

Message- passing 

In the new message-passing version of LAURA, the 
computational domain is again subdivided into blocks 
along any of the three ( i,j,k ) coordinate directions 
with each block assigned to a processor. As com- 
pared to the macrotasking version, this is analogous 
to defining each block to contain only one partition 
and assigning each partition to a separate task. The 
number of blocks is therefore equal to the total num- 
ber of processors. Due to the distributed memory of 
the processors, each task only requires storage only for 
its own block plus storage for boundary data from as 
many as six neighboring blocks (i.e. one for each of the 
six block faces). Figure 2 shows a 2D domain divided 
equally into 4 separate blocks. Each processor works 
only on its own block and pauses at user-specified in- 
tervals to exchange boundary data with its neighbors. 
The boundary data exchange is explicitly handled with 
send and receive calls from the MPI message-passing 
library. 21 The MPI library was chosen because it is 
a standard and because there are multiple implemen- 
tations that run on workstations as well as dedicated 
parallel machines. 27 ’ 28 Synchronization of tasks oc- 
curs when messages are exchanged, but this exchange 
is not required for any particular iteration due to the 
point-implicit relaxation scheme. As in the macro- 
tasking version, tasks (or blocks) of various sizes may 
accumulate differing numbers of iterations during a 
run. For blocks of equal size, it may be convenient 
to synchronize the message exchange at specified iter- 
ation intervals. 

Results 

The performance of the distributed-memory, 
message-passing version of LAURA is examined 


in terms of computational speed and convergence. 
Measuring the elapsed wall clock time of the code on 
different machines estimates the communication over- 
head and message-passing efficiency of the code. The 
communication overhead associated with exchanging 
boundary data information between nodes depends 
on the parallel machine, the size of the problem, and 
the frequency of exchanges. The frequency of data 
exchanges may be decreased if necessary to reduce the 
communication penalty, but this may adversely affect 
convergence. Therefore, the impact of boundary data 
exchange frequency on convergence is determined for 
several realistic vehicles and flow conditions. 

Computational Speed 

Timing estimates using the message-passing version 
of LAURA are presented for an IBM SP2, an SGI 
Origin 2000, an SGI multiprocessor machine, and a 
network of SGI R10000 workstations. The single-node 
performance of LAURA on a cache-based (as opposed 
to vector) architecture is not addressed. Viscous, per- 
fect gas computations are performed on the forebody 
of an X-33 10, 11 configuration with a grid size of 64 x 56 
x 64. The computational domain is split along each of 
the coordinate directions (depending on the number of 
nodes) into blocks of equal size. The individual block 
sizes are shown below. Because the blocks are equal in 

Table 1 Block sizes for timing study. 
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size, boundary data exchanges are synchronized at a 
specified iteration interval for convenience. Each run 
begins with a partially converged solution and is run 
for 200 iterations with second order accuracy. Two 
values (1 and 20) are used for nexch , the number of 
iterations between boundary data exchanges, to esti- 
mate the communication overhead on each machine. 
The number of iterations that the Jacobian is held 
fixed, njcobian, is equal to 20 and represents a typical 
value for solutions that are partially converged. 

Four different architectures are used to obtain tim- 
ing estimates. The first is a 160-node IBM SP2 located 
at the Numerical Aerospace Simulation (NAS) Facility 
at NASA Ames using IBM’s implementation of MPI. 
The second is a 64 processor (R10000) SGI Origin 2000 
also located at NAS using SGI’s version of MPI. The 
third is a 12 processor (R10000) SGI machine oper- 
ating in a multiuser environment, and the fourth is 
a network of SGI R10000 workstations connected by 
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Fig. 3 Elapsed wall clock time on IBM SP2. 
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Fig. 4 Elapsed wall clock time on SGI Origin 2000. 

Ethernet. Both of these SGI machines use the MPICH 
implementation 27 of MPI. On all architectures, the 
MPI defined timer, MPLWTIME, is used to measure 
elapsed wall clock time for the main algorithm only. 
The time to read and write restart files and to perform 
pre- and post-processing is not measured although it 
may account for a significant fraction of the total time. 
Compiler options include ‘-03 -qarch=pwr2’ on the 
IBM SP2 and ‘-02 -n32 -mips4’ on the SGI machines. 
No effort is made to optimize the single-node perfor- 
mance of LAURA on these cache-based architectures. 

Figures 3-6 display the elapsed wall clock times 
on the various machines. A time based on the 
single-node time and assuming a linear speedup equal 
to the number of nodes is shown for comparison. The 
measured times are less than the comparison time for 
most of the cases as a result of the smaller blocks 
on each node making better use of the cache. This 
increase in cache performance offsets the communica- 
tion penalty. Improving the single-node performance 
of LAURA on these cache-based architectures would 
reduce the single-node times and give a more accu- 



Fig. 5 Elapsed wall clock time on SGI multipro- 
cessor. 
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Fig. 6 Elapsed wall clock time on network of SGI 
R10000 workstations. 


rate measure of the communication overhead. Never- 
theless, the speedup of the code on all machines is 
good. As anticipated, the relative message-passing 
performance on the dedicated machines (IBM SP2, 
SGI Origin 2000, SGI multiprocessor) is better than 
on the network of SGI workstations. Also, the per- 
formance with data exchanged every 20 iterations is 
noticeably better on the network of workstations than 
with data exchanged every iteration. However, there 
is little influence of nexch on elapsed time on the 
dedicated machines which indicates that the commu- 
nication overhead is very low. The degradation in 
performance of the 8 processor runs on the SGI mul- 
tiprocessor is due to the load on the machine from 
other users and is not a result of the communication 
overhead. Of course, the times (and message-passing 
efficiency) measured will vary depending on machine 
and problem size. Also shown in Fig. 3 is the elapsed 
time from a multitasking run with the original version 
of LAURA on a CRAY C-90 using 9 CPU’s. This 
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Fig. 7 Vehicle geometries. 

shows that performance comparable to current vector 
supercomputers may be obtained on dedicated paral- 
lel machines (albeit with more processors) using this 
distributed-memory version of LAURA. 

Convergence 

The effect of problem size, gas chemistry, and 
boundary data exchange frequency on convergence is 
examined for four realistic geometries: the X-33 10,11 
and X-34 12 RLV concepts, the X-33 forebody, and 
the Stardust sample return capsule forebody. 8 All 
four geometries are shown in Fig. 7. A viscous (thin- 
layer Navier-Stokes), perfect gas solution is computed 
over the X-33 and X-33 forebodv configurations. The 
convergence of an inviscid, perfect gas solution is ex- 
amined using the X-34 vehicle. Nonequilibrium air 
chemistry effects on the convergence and performance 
of the distributed-memory version of LAURA are de- 
termined from a viscous, 11-species air calculation over 
the Stardust capsule. For all geometries, the vehicle is 
defined by the k = 1 surface, and the outer boundary 
of the volume grid is defined by k = kmax. 

Each viscous solution is computed with the same 
sequence of parameters for consistency and is started 
with all flow-field variables initially set to their 
freestream values. Slightly different values are used 
for the inviscid solutions due to low densities on the 
leeside of the vehicle causing some instability when 
switching from first to second order accuracy. Methods 
to speed convergence such as computing on a coarse 
grid before proceeding to the fine grid and converging 
blocks sequentially beginning at the nose (i.e. block 
marching) 5 are not used. The relevant LAURA param- 
eters are shown below. Two values of nexch are used 
(except for the run involving the complete X-33 config- 
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Table 2 LAURA parameters - viscous. 

Iterations Order njcobian 

0M00 I I 

101-300 1 2 

301-500 2 10 

> 500 2 20 


Table 3 LAURA parameters - inviscid. 

Iterations Order njcobian 

R100 I I 

101-300 1 2 

301-900 1 10 

901-1100 2 10 

> 1100 2 20 


uration). A baseline solution is generated with nexch 
equal to 1. Updating the boundary data every itera- 
tion should mimic the communication between blocks 
in the shared-memory version of LAURA. A second 
computation is made with nexch equal to njcobian 
since acceptable values for both parameters depend 
on transients in the flow. Solutions that are chang- 
ing rapidly should update the Jacobian and exchange 
boundary data frequently, while partially converged 
solutions may be able to freeze the Jacobian and lag 
the boundary data for a number of iterations. A simple 
strategy is to link the two parameters. Convergence is 
measured by the L 2 norm defined by 


L-2 


1 v-' (iT • r L ) 

n'2 2^ n 2 

°jv L=1 Pl 


( 2 ) 


where Cn is the Courant number, N is the total num- 
ber of cells, yl is the residual vector, and pL is the 
local density. All solutions are generated on the IBM 
SP2. 

X-33 

The viscous, perfect gas flow field is computed over 
the X-33 RLV configuration (without the wake) to 
demonstrate the ability of the new message-passing 
version of LAURA to handle large-scale problems in 
a reasonable amount of time. The freestream Mach 
number is 9.2, the angle of attack is 18.1 deg, and the 
altitude is 48.3 km. The grid size is 192 x 168 x 64 
and is divided into 64 blocks of 48 x 42 x 16. 

Figure 8 shows the L 2 convergence as a function of 
number of iterations. The elapsed wall clock time on 
the SP2 is 12.7 hr. Only the baseline cas e(nexch = 1) 
was computed due to resource limitations. The effect 
of nexch on convergence will be examined in greater 
detail on the nose region of this vehicle. The stall 
in convergence after 10000 iterations is due to a limit 
cycle in the convergence at the trailing edge of the 
tip of the canted fin. Iterations are continued past 
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Fig. 8 Convergence history of viscous, perfect gas 
flow field over X-33 vehicle. 

this point to converge the boundary layer and surface 
heating. 

X-33 forebody 

The effects of boundary data exchange frequency 
and block splitting on convergence are evaluated for 
the nose section of the X-33. This is the same con- 
figuration used to obtain the timing estimates, and 
freestream conditions correspond to the complete X- 
33 vehicle case. The 64 x 56 x 64 grid is first divided 
in the £-,• j-, and k- directions into 16 blocks comprised 
of 32 x 28 x 16 cells each. Two cases, nexch = 1 and 
nexch = njcobian , are run using this blocking. An- 
other case is computed with the grid divided in the i- 
and j-directions only resulting in blocks of 16 x 14 x 64 
cells. Next, the asynchronous relaxation capabilities of 
LAURA are tested by reblocking a partially converged 
restart file in the ^-direction to cluster work (and it- 
erations) in the boundary layer. Each block has i x j 
dimensions of 32 x 28, but the k dimension is split into 
8, 8, 16, and 32 cells. Blocks near the wall contain 32 
x 28 x 8 cells, while blocks near the outer boundary 
have 32 x 28 x 32 cells. Thus, the smaller blocks ac- 
cumulate more iterations than the larger outer blocks 
in a given amount of time and should converge faster. 

Figure 9 shows the convergence history for this flow 
field. For viscous solutions, convergence is typically 
divided into two stages. First, the inviscid shock layer 
develops and then the majority of the iterations are 
spent converging the boundary layer (and surface heat- 
ing). Lagging the boundary data appears to have more 
of an impact on the early convergence of the inviscid 
features of the flow and less of an impact on the bound- 
ary layer convergence. This effect is much larger when 
the blocks are split in the fc-direction across the shock 
layer. The communication delay affects the develop- 
ing shock wave as it crosses the block boundaries in 



a) Convergence as function of number of iterations 



b) Convergence as function of time 


Fig. 9 Convergence histories of viscous, perfect 
gas flow field over X-33 forebody. 

the /c-direction. 

Figure 9(b) shows the convergence as a function of 
wall clock time. Because of the low communication 
overhead on the IBM SP2, the time saved by mak- 
ing fewer boundary data exchanges is small. As seen 
from the timing data, this would not necessarily be 
true on a network of workstations where the decrease 
in communication overhead might offset any increase 
in number of iterations. Also shown are LAURA % 
asynchronous relaxation capabilities. After 1 hr (and 
3500 iterations), the outer inviscid layer is partially 
converged. Restructuring the block structure at this 
point by splitting the k dimension into 8, 8, 16, and 
32 cells allows the boundary layer to accumulate more 
iterations and accelerates convergence. The result is a 
15 percent decrease in wall clock time compared to the 
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baseline ( nexch = 1) case. A similar strategy would 
also have accelerated the convergence of the baseline 
case. 

X-34 

The effect of boundary data exchange frequency and 
block splitting on convergence of inviscid, perfect gas 
flows is investigated for the X-34 configuration (minus 
the body flap and vertical tail). Inviscid solutions are 
useful in predicting aerodynamic characteristics for ve- 
hicle design and may be coupled with a boundary-layer 
technique to predict surface heat transfer as well. The 
freestream Mach number is 6.32, the angle of attack is 
23 deg, and the altitude is 36 km. The grid is 120 x 
152 x 32 and is first divided into 32 blocks of 30 x 38 x 
16 cells. The grid is also split in the i- and j-directions 
into blocks of 30 x 19 x 32 cells to check the effect of 
block structure on convergence. The convergence his- 
tories are shown in Figure 10. The aerodynamics (not 
shown) of the vehicle are converged at 4000 iterations. 
The spike in convergence at 900 iterations is caused by 
the switch from first to second order accuracy. With 
the grid split in all directions, the baseline solution 
( nexch = 1) reaches an L 2 norm of 10 -3 at 3300 it- 
erations while the solution with boundary data lagged 
takes 3640 iterations. The solution with the grid split 
in the i- and j-directions requires 3530 iterations. As 
shown in Fig. 10(b), there is a corresponding difference 
in run times to reach that convergence level because 
the savings from fewer boundary data exchanges are 
small on the SP2. Nevertheless, the effect of lagging 
the boundary data on convergence is minimal. 

Stardust 

The convergence of a nonequilibrium air (11 species, 
two temperature), viscous computation is examined 
for the forebody of the Stardust capsule. The 
freestream Mach number is 17 and the angle of at- 
tack is 10 deg. The grid is 56 x 32 x 60 and is divided 
into 32 blocks of 7 x 8 x 60 cells each. There are 
no splits in the ^-direction. Figure 11 shows the con- 
vergence as a function of iterations and elapsed wall 
clock time. Because of the larger number of flow-field 
variables, considerably more data must be exchanged 
between blocks for nonequilibrium flows. Even on a 
dedicated parallel machine such as the IBM SP2, the 
communication penalty for this particular case has a 
significant impact on the elapsed time. The baseline 
case reaches an L 2 norm of 10 -4 at 6900 iterations 
compared to 7500 iterations for the nexch = njcobian 
solution. However, the savings in communication time 
allows the nexch = njcobian solution to converge 1 hr 
faster than the baseline case. 

Conclusions 

The shared-memory, multitasking version of the 
CFD code LAURA has been successfully modified to 



a) Convergence as function of number of iterations 



b) Convergence as function of time 


Fig. 10 Convergence histories of inviscid, perfect 
gas flow field over X-34 vehicle. 

take advantage of distributed-memory parallel ma- 
chines. A standard domain decomposition strategy 
yields good speedup on dedicated parallel systems, 
but the single-node performance of LAURA on cache- 
based architectures requires further study. The point- 
implicit relaxation strategy in LAURA is well-suited 
for parallel computing and allows the communication 
overhead to be minimized (if necessary) by reducing 
the frequency of boundary data exchanges. The com- 
munication overhead is greatest on the network of 
workstations and for nonequilibrium flows due to more 
data passing between nodes. Lagging the boundary 
data between blocks appears to affect the development 
of the inviscid shock layer more than the convergence 
of the boundary layer. Its largest effect occurs when 
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a) Convergence as function of number of iterations 



b) Convergence as function of time 


Fig. 11 Convergence histories of viscous, nonequi- 
librium flow field over Stardust capsule. 


the grid is split in the direction normal to the vehicle 
surface. However, restructuring the blocks to cluster 
work and iterations in the boundary layer improves 
overall convergence once the inviscid features of the 
flow have developed. These results demonstrate the 
ability of the new message-passing version of LAURA 
to effectively use distributed-memory parallel systems 
for realistic configurations. As a result, the effective- 
ness of LAURA as an aerospace design tool is enhanced 
by its new parallel computing capabilities. In fact, this 
new version of LAURA is currently being applied to 
the evaluation of vehicles used in planetary exploration 
missions and the X-33 program. 
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